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Abstract 

We consider here estimation of an unknown probability density s belonging 
to L2(/ii) where fi is a probability measure. We have at hand n i.i.d. observations 

■ with density s and use the squared L2-norm as our loss function. The purpose of 
\^ I this paper is to provide an abstract but completely general method for estimating 

s by model selection, allowing to handle arbitrary families of finite-dimensional 
(possibly non- linear) models and any s € L2(/i). We shall, in particular, consider 
the cases of unbounded densities and bounded densities with unknown Loo-norm 
00 I and investigate how the Loo-norm of s may influence the risk. We shall also 

I provide applications to adaptive estimation and aggregation of preliminary es- 

I timators. Although of a purely theoretical nature, our method leads to results 

that cannot presently be reached by more concrete ones. 

■ 1 Introduction 



1.1 Histograms and partition selection 

Suppose we have at hand n i.i.d. observations Xi, . . . ,Xn with values in the mea- 
surable space {X,W) and they have an unknown density s with respect to some 
probabihty measure fj, on JY. The simplest method for finding an estimator of s is to 
build an histogram. Given a finite partition X = {/i, . . . , Ik} of X with /u(Ij) = Ij > 
for 1 < j < k, the histogram sj based on this partition is defined by 

^ fc n 

h{Xi,...,Xn) = —Y,Njlj^, withNj = J2mXi). (1.1) 

^-J' j=l i=l 
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Pj 



s dfj., sx 



7 = 1 J 



and Sx 



k 



/3j G M for 1 < j < A; 



If s G 1^2(1^), then sj is the orthogonal projection of s onto the A;-dimensional hnear 
space Sx spanned by the functions 1/^ . Choosing the squared L2-distance induced 
by the norm || • || of L2(^) as our loss function leads to the following quadratic risk 
for the estimator sj: 



E \\sx-s\ 



1 



\sx ■ 



+ -E 



Pji^-Pj) 



(1.2) 



Hence, if s G Loo(/u), with norm ||s||oo? the quadratic risk of sx can be bounded by 



E 



sx 



< Pi 



„l|2 , (fc-l)PI 



n 



(1.3) 



and, as we shall see below, this bound is essentially unimprovable without additional 
assumptions. 

The histogram estimator sx is probably the simplest example of a model-based 
estimator with model Sx, i.e. an estimator of s with values in Sx- It may acually be 
viewed as the empirical counterpart of the projection sx of s onto Sx- 

Suppose now that we are given a finite (although possibly very large) family 
{Im,fn G A4} of finite partitions of X with respective cardinalities \Im\, hence the 
corresponding families of models {^j^, m £ M} and histogram estimators m G 
M}. It is natural to try to find one estimator in the family which leads, at least ap- 
proximately, to the minimal risk inim^M ^ [px™ ~ sP] • But one cannot select such 
an estimator from (jl.2p or (jl.3p since the risk depends on the unknown density s via 
sxm- Methods of model or estimator selection base the choice of a suitable partition 
Im with m = rh{Xi, . . . ,X„) on the observations. When s G Loo(/i) one would like 
to know whether it is possible to design a selection procedure 'm{Xi, . . . , Xn) leading 
(at least approximately), in view of (jl.3p . to a risk bound of the form 



E 



■St- — s 



<C inf {||^x„ 



s\\ -\- n II 5 ||oo |-^m I } 



for some universal constant C, even when ||s||oo is unknown. 



1.2 What is presently known 

There exists a considerable amount of litterature dealing with problems of model or 
estimator selection. Most of it is actually devoted to the analysis of Gaussian prob- 
lems, or regression problems, or density estimation with either Hellinger or Kullback 
loss and it is not our aim here to review this litterature. Only a few papers are 
actually devoted to our subject, namely model or estimator selection for estimating 
densities with L2-I0SS, and we shall therefore concentrate on these only. These papers 
can roughly be divided into three groups: the ones dealing with penalized projection 
estimators, the ones that study aggregation by selection of preliminary estimators 
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and the ones which use methods based on the thresholding of empirical coefficients 
within a given basis. The last ones are typically not advertised as dealing with model 
selection but, as explained for instance in Section 5.1.2 of Birgc and Massart (2001), 
they can be viewed as special instances of model selection methods for models that 
are spanned by some finite subsets of an orthonormal basis. All these papers have in 
common the fact that they require more or less severe restrictions on the families of 
models and, apart from some special cases, typically assume that s G Loo(Ai) with a 
known or estimated bound for ||s||oo- 

In order to see how such methods apply to our problem of partition selection, 
let us be more specific and assume that X = [0, 1], /j, is the Lebesgue measure and 
Af = {j/{N + 1),1 < j < N} for some (possibly very large) positive integer A''. 
For any subset m of J\f, we denote by 2^ the partition of X generated by the in- 
tervals with set of endpoints m U {0, 1} and we set Sm = Sj^ and Sm = sjm- This 
leads to a set M with cardinality 2^ and the corresponding families of linear models 
{iS'^,m G Al} and related histogram estimators {sm,m G Ai}. Then all models Sm 
are linear subspaces of the largest one Sj^. Of particular interest is the dyadic case 
with = 2^ — 1 for which Sj^/ is the linear span of the 2^ first coefficients of the 
Haar basis. There is, nevertheless, a difference between expansions in the Haar basis 
and projections on our family of models. Let us, for instance, consider the function 
l[o,2-^)- If belongs to the two-dimensional model S^ij but its expansion in the Haar 
basis has K non-zero coefficients. 

Given a sample Xi, . . . , X„ with unknown density s, which partition Im should we 
choose to estimate s and what bound could we derive for the rcsTilting estimator? 
Penalized projection estimators have been considered by Birgc and Massart (1997) 
and an improved version is to be found in Chapter 7 of Massart (2007) . The method 
either deals with polynomial collections of models (which does not apply to our case) 
or with subset selection within a given basis which applies here only when A^ = 2^^ — 1 
and we use the Haar basis. Moreover, it requires that A^ < n/logn and a bound on 
IIsiaaIIoo be known or estimated, as in Section 4.4.4 of Birge and Massart (1997), since 
the penalty depends on it. 

Methods based on wavelet thresholding, as described in Donoho, Johnstone, Kerky- 
acharian and Picard (1996) or Kerkyacharian and Picard (2000) (see also the numer- 
ous references therein) require the same type of restrictions and, in particular, a 
control on 1 1 s I loo in order to properly calibrate the threshold. Also, as mentioned 
above, restricting to subsets of the Haar basis may result in expansions that use 
many more coefficients {K instead of 2, for instance) than needed with the partition 
selection approach. 

Aggregation of estimators by selection assumes that preliminary estimators (one 
for each model in our case) are given in advance (we should here use the histograms) 
and typically leads to a risk bound including a term of the form ?t-~^||s||oo log \M\ = 
n~^A"||s||oo log 2 so that all such results are useless for N > n. Moreover, most of them 
also require that an upper bound for ||s||oo be known since it enters the construction 
of the aggregate estimator. This is the case in RigoUet (2006) (see for instance his 
Corollary 2.7) and Juditsky, Rigollet and Tsybakov (2007, Corollary 5.7) since the 
parameter (5 that governs their mirror averaging method depends crucially on an 
upper bound for ||s||oo- As to Samarov and Tsybakov (2005), their Assumption 1 
requires that N be not larger than Clogn. Similar restrictions are to be found in 
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Yang (2000) in his developments for mixing strategies and in Rigollet and Tsybakov 
(2007) for linear aggregation of estimators. Lounici (2008) does not assume that 
s G Loo but, instead, that all preliminary estimators are uniformly bounded. One can 
always truncate the estimators to get this, but to be efficient, the truncation should 
be adapted to the unknown parameter s, and therefore chosen from the data in a 
suitable way. We do not know of any paper that allows such a data driven choice. 

Consequently, none of these results can solve our partition selection problem in a 
satisfactory way when N is at least of size n and ||s||oo is unknown. This fact was 
one motivation for our study of model selection for density estimation with L2-I0SS. 
Results about partition selection will be a consequence of our general treatment of 
model selection. This treatment allows to consider arbitrary countable families of 
finite-dimensional models (possibly nonlinear) and docs not put any assumption on 
the density s, apart from the fact that it belongs to L,2{n); it may, in particular, be 
unbounded. We do not know of any result that applies to such a situation. There is 
a counterpart to this level of generality: our procedure is of a purely abstract nature 
and not constructive, only indicating what is theoretically feasible. Unfortunately, 
we are unable to design a practical procedure with similar properties. 



2 Model based estimation and model selection 

To begin with, let us fix our framework and notations. Wc want to estimate an 
unknown density s, with respect to some probability measure on the measurable 
space {X, W), from an i.i.d. sample X = {Xi, . . . , X„) of random variables Xi e X 
with distribution Pg = s ■ iJ,. Throughout the paper we denote by Fg the probability 
that gives X the distribution P^'^, by E^, the corresponding expectation operator and 
by II • \\q the norm in Lg(/x), omitting the subscript when q = 2 for simplicity. We 
denote by ^2 the distance in L2(/n): d2{t,u) = \\t — u\\. For 1 < q < +00 and F > 1, 
we set 



|iGLg(/Lt) i>Oand y"td/Lt = l|; = {t G Loo | p||oo < T} . (2.1) 



We measure the performance at s G L2 of an estimator s{X) G L2 by its quadratic risk 
Eg (s(X), s)] . More generally, if (M, d) is a metric space of measurable functions 
on X such that MflLi 7^ 0, the quadratic risk of some estimator s G M at s G MnLi 
is defined as Eg [d^ {s{X), s)] . We denote by |X| the cardinality of the set I and set 
aV6 and aAb for the maximum and the minimum of a and b, respectively. Throughout 
the paper C (or C", . . . ) will denote a universal (numerical) constant and C{a, b,...) 
or Cq a fonction of the parameters a,b, . . . or q. Both may vary from line to line. 
Finally, from now on, countable will always mean "finite or countable" . 



2.1 Model based estimation 

A common method for estimating s consists in choosing a particular subset S of 
(M, d) that we shall call a model for s and design an estimator with values in S. Of 
this type are the maximum likelihood estimator over S or the projection estimator 
onto S. Let us set M = Li and choose for d either the Hellinger distance h or the 
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variation distance v given respectively by 

h^{t,u) = - J — d/j, and v{t,u) = - J |t — u|c//x. 

It follows from Le Cam (1973, 1975, 1986) and subsequent results by Birge (1983, 
2006a) that the risk of suitably designed estimators with values in S is the sum of 
two terms, an approximation term depending on the distance from s to S and an 
estimation term depending on the dimension of the model S which can be defined as 
follows. 

Definition 1 Let S be a subset of some metric space (M, d) and let Bd{t, r) denote 

the open ball of center t and radius r with respect to the metric d. Given r] > 0, a 
subset Sr^ of M is called an rj-net for S if, for each t & S, one can find t' € Sr^ with 
d{t,t') < T]. 

We say that S has a metric dimension bounded by D > if, for every > 0, there 
exists an rj-net Sjj for S such that 

\Srj n Bd{t, xr])\ < exp [Dx'^] for all x > 2 and t G M. (2.2) 

Remark: One can always assume that Srj G S at the price of replacing D by 25D/4 
according to Proposition 7 of Birge (2006a). 

Typical examples of sets with metric dimension bounded by D when (M, d) is a 
normed linear space are subsets of 2Z'-dimensional linear subspaces of M. as shown 
in Birge (2006a). If d is either h or v and 5 C Li has a metric dimension bounded 
hj D > 1/2, there exists a universal constant C and an estimator s{X) with values 
in S such that, for any s G Li, 



E, [d^{s{X),s)\ <C 



inf_d^{s,t) + n-^D 



(2.3) 



In particular, sup^g^^E^ [d'^ {s{X), s)] < Cn~^D. This results from the following 
theorem about model selection of Birge (2006a) by setting M = {0}, Sq = S, Dq = D 
and Ao = 1/2. 

Theorem 1 Let Xi, . . . , Xn be an i.i.d. sample with unknown density s belonging 
to Li and [Sm,'m' G A^} a finite or countable family of subsets of Li with metric 
dimensions bounded by > 1/2 respectively. Let the nonegative weights Am,m G 
M satisfy 

^ exp[-A^] = S < +00. (2.4) 

Then there exists a universal constant C and an estimator s{Xi, . . . ,Xn) such that, 
for any s G Li, 



Es [d^ {s, s)] < C{1 + S) inf 



meM Ites, 



inf d^{s,t) + n~^{Djny \. 



(2.5) 
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Unfortunately, as we shall see below, (12. 3|) does not hold in general when (M, d) = 
(L2(//), (i2)- In particular, whatever the estimator s, sup^g^E^ [d^(s(X),s)] may 
be infinite even if S* C I^2ip) has a bounded metric dimension. This difference is 
due to the following fact: h and v are actually distances defined on the set of all 
probabilities on {X,W) and h{s,t) = h{Ps,Pt) is independent of the choice of the 
underlying dominating measure, the same property holding for the variation distance 
V. This is not the case for 3,2 which is a distance in L2(^) depending on the choice of 
jj, and definitely not a distance between probabilities. Even the fact that s = dPs/d^ 
belong or not to L2(/i) depends on /i. Further remarks on this subject can be found 
in Devroye and Gyorfi (1985) and Devroye (1987). 

Nevertheless, the L2-distance has been much more popular in the past than either 
the Hellinger or variation distances, mainly because of its simplicity due to the clas- 
sical "squared bias plus variance" decomposition of the risk. Although hundreds of 
papers have been devoted to the derivation of risk bounds for various specific esti- 
mators, we do not know of any general bound for the risk similar to (12. 3p based on 
purely metric considerations for the distance d2- 



2.2 Projection and histogram estimators 



To illustrate the specificity of the L2-risk, let us turn to a quite classical family of 
model-based estimators for densities, the projection estimators of Cencov (1962). 
To estimate a density s G L2 from an i.i.d. sample Xi,... ,X„, we chose some k- 
dimensional linear subspace S of L2(/u) together with an orthonormal basis {ipi , . . . , (/j^) 

k o Then we 
J" ^pjsd^i in this expansion by its empirical version 



so that the projection s of s onto S can be written s = "^j^i Pjfj- 



estimate each coefficient (3j 
f3j = n'~^Y17=iy^ji-^i)- This results in the projection estimator s 
(which in general does not belong to Li) with risk 



\\s-s\ 



\s-sf + n ^^Var, (v'j(^i)) 

^ I k 

< lis- sip -Fn"^ 



< ||s — s|P + n ^ min 




s(x) dfi{x) 



5 ^ll'^llcxD 



(2.6) 



A particular case occurs with the histogram sj given by (jl.ip which corresponds to 
choosing (pj = I - ^^'^Ij^, S = Sj and s = sj. If Ij = for all j, we get a regular 
histogram and derive from (jl.2p and a convexity argument that 



E, 



SI 



< 



■sjf + {k-l)/n. 



But, for general partitions, the bound (jl.Sp clearly emphasizes the difference with the 
risk bound of the form (12. 3p obtained in Birge and Rozenholc (2006) for the Hellinger 
loss: 

Es [h\s,h)] < h\s,si) + {k- l)/(2n). 
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Moreover, ()1.3p is essentially unimprovable without further assumptions on s if the 
partition X is arbitrary, as shown by the following example. Define the partition 
I on X = [0,1] by Ij = [{j - l)a,ja) for 1 < j < A; and h = [{k - 1)q,1] with 

< a < (A; - Set s = sj = [{k - l)a]~^ [1 - l/J. Then pj = {k - 1)''^ for 

1 < j < k, s = sj and it follows from (jl.2[) that 

k-2 ^ (fc-2)||s||oo 
(k — l)an n ' 

which shows that there is little space for improvement in (jl.3p . 
2.3 Some negative results 

The fact that the Lgo-norm of s comes into the risk is not due to the use of his- 
tograms or projection estimators as shown by another negative result provided by 
Proposition 4 of Birge (2006b) that we recall below for the sake of completeness. 

Proposition 1 For each L > and each integer D with 1 < D < 3n, one can find 
a finite set S of densities with the following properties: 

i) it is a subset of some D-dimensional affine subspace o/L2([0, 1], dx) with a metric 
dimension bounded by D/2; 

a) sup^g;^ ||s||oo <L + 1; 

Hi) for any estimator s{Xi, . . . , Xn) belonging to L2([0, 1], dx) and based on an i.i.d. 
sample with density s £ S, 

supE^ [p - sf] > 0.0139L»Ln"^ (2.7) 

s&S 

It follows that there is no hope to get an analogue of (j2.3p . under the same assump- 
tions, when d = d2 and the best one can expect in full generality, when 5 is a model 
with metric dimension bounded by D and s G Lqo, is to design an estimator s with a 
risk bounded by 

E, [dl{s,s)\ <C 

The situation becomes worse when s Loo(^) or if sup^^^ Ik I loo = +oo as shown by 
the following lower bound to be proved in Section 17. li 

Proposition 2 Let S = {sg,0 < 6 < 1/3} be the set of densities with respect to 
Lebesgue measure on [0, 1] given by 

se = e-%,es] + {9^ + e + 1)-^ 

// we have at disposal n i.i.d. observations with density sg £ S, we can build an 
estimator s„ such that supo<e<i/3 E^g [n/i^(se, s„)] < C for some C independent of 
n. On the other hand, although the metric dimension of S with respect to the distance 
d2 is bounded by 2, supo<5i<]^/3 E^^ [||s0 — Sn|P] = +oo, whatever n and the estimator 




in_f 6^2(5, t) +n 



(2. 
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2.4 About this paper 

Our purpose here is twofold. We first want to derive estimators achieving a risk bound 
which generahzes (|2.8p in the sense that it could also apply to the case of s Lqo • We 
know from (j2.6p that projection estimators do satisfy (12. Sp when S* is a D-dimensional 
linear space, but do not have any result for non-linear models. Our second goal is to 
handle many models simultaneously and design an estimator which performs as well 
(or almost as well) as the estimator based on the "best model", i.e. one leading to 
the smallest risk bound (up to some universal constant). This is possible when the 
distance d is either the Hellinger distance h or the variation distance v, as shown by 
Theorem [TJ As compared to the bound (|2.3p . we simply pay the price of replacing Dm 
by Am when A.^ > Dm- This price is due to the complexity of the family of models 
we use (there is nothing to pay in the simplest case of one model per dimension) and 
this price is essentially unavoidable, as shown in a specific case by Birge and Massart 
(2006). 

It follows from the previous section that it is impossible to get an analogue of 

Theorem [1] when d = d2- We shall explain what kind of (necessarily weaker) results 

can be obtained in this context and to what extent Theorem [T] can be rescued. For 

this, we shall proceed in several steps. In the next section we shall explain how 

to build estimators based on families of special models Sm, following the method 

p 

explained in Birge (2006a). These models need to be discrete subsets of L^^ (for 
some given F) with bounded metric dimension while there is no reason that our 
initial models Sm be of this type (think of linear models). Section U] will therefore 
be devoted to the construction of such special models Sm from ordinary ones. This 
construction will finally lead to an estimator belonging to L^^, the performance 
of which strongly depends on our choice of F. In Section [5l we shall explain how, 
given a geometrically increasing sequence (Fj)j>i of values of F and the corresponding 
sequence of estimators s^' , we can use the observations to choose a suitable value for 
F. Since we have a single sample X to build the estimators s^' and to choose F, we 
shall proceed by sample splitting using one half of the sample for the construction of 
the estimators and the second half to select a value of F. In particular, for the case 
of a single model, this will lead to a generalized version of ()2.8p that can also handle 
the case of s Lqo. When s S Lqo (with an unknown value of ||s||oo), the risk bounds 
we get completely parallel (apart from some constants depending on ||s||oo) those 
obtained for estimating s in the white noise model. We shall give a few applications 
of these results, in particular to aggregation of preliminary estimators, in Section [6], 
while the last section will be devoted to the most technical proofs. 

3 T-estimators for L2-I0SS 

In order to define estimators based on families of models with bounded metric dimen- 
sions, we shall follow the approach of Birge (2006a) based on what we have called 
T-estimators. We refer to this paper for the definition of these estimators, recalling 
that it relies on the existence of suitable tests between balls of the underlying metric 
space (L2,(i2)- To derive such tests, we need a few specific technical tools to deal 
with the L2-distance. 
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3.1 Tests between L2-balls 



3.1.1 Randomizing our sample 

In the sequel we shall make use of randomized tests based on a randomization trick 
due to Yang and Barron (1998, page 106) which has the effect of replacing all densities 
involved in our problem by new ones which are uniformly bounded away from zero. 
For this, we choose some number A G (0, 1) and consider the mapping r from L2 to 
L2 given by t{u) = Xu + 1 — A. Note that r is one-to-one and isometric, up to a factor 

A, i.e. d2{T{u), t{v)) = Xd2{u, v). If u G L^^, then r(n) € L^^ with F' = AF -|- 1 — A. 

Let s' = t{s). Given our initial i.i.d. sample X, we want to build new i.i.d. vari- 
ables X[,... ,X'^ with density s'. For this, we consider two independent n-samples, 
Zi, . . . , Zn and £1, . . . , e„ with respective distributions ^ and Bernoulli with parame- 
ter A. Both samples are independent of X. We then set X'- = EiXi -|- (1 — EijZi for 
1 < i < n. It follows that X[ has density s' as required. We shall still denote by 
the probability on that gives X' = {X[, . . . ,X'^) the distribution P®". Given two 
distinct points t, u G L2 we define a (randomized) test function if^i^X') between t and 
« as a measurable function with values in {i, u}, tp{X') = t meaning deciding t and 
ip{X') = u meaning deciding u. 

Once we have used the randomization trick of Yang and Barron, for instance with 
A = 1/2, we deal with an i.i.d. sample X' with a density s' which is bounded from 
below by 1/2 and we may therefore work within the set of densities that satisfy this 
property. 



3.1.2 Some minimax results 

The main tool for the design of tests between L2-balls of densities is the following 
proposition which derives from the results of Birge (1984) (keeping here the notations 
of that paper) and in particular from Corollary 3.2, specialized to the case of I = {t} 
and c = 0. 

Proposition 3 Let M be some linear space of finite measures on some measurable 
space {Q,A) with a topology of a locally convex separated linear space. Let V, Q be 
two disjoint sets of probabilities in M and F a set of positive measurable functions 
on Q with the following properties (with respect to the given topology on M): 
i) V and Q are convex and compact; 

a) for any f E F and < z < 1 the function P J f^dP is well-defined and 
upper semi-continuous onVU Q; 

Hi) for any P E V, Q E Q, t E (0,1) and £ > 0, there exists an f E F such that 



{l-t)JfdP + tJ f^-'dQ< j{dPf-\dQf + 



iv) all probabilities in V (respectively in Q) are mutually absolutely continuous. 
Then one can find P E V and Q £ Q such that 



,£)*dp = sup /(£■)' , 

Per J \PJ QeQJ \QJ Per,QeQ 



sup I {^] dP = sup / ( £ ) dQ = sup j{dPf-\dQY 



= J{dP)'-\dQy 
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In Birge (1984) we assumed that was the set of all finite measures on {^},A) but 
the proof actually only uses the fact that V and Q are subsets of Ai. Recalling that 
the Hellinger affinity between two densities u and v is defined by p{u, v) = J \^uv dfi = 
1 — h?{u,v), we get the following corollary. 

Corollary 1 Let ^ he a probability measure on and, for \ < i < n, let 

i'Pi, Qi) be a pair of disjoint convex and weakly compact subsets o/L2(/i) such that 

/n 
s dfi = I for all s £ \^ {Vi U Qi) . (3.1) 



For each i, one can find pi £ Vi and qi G Qi such that 

sup / y^Qi/piudfi = sup / y^pi/qivdfj. = sup 



p{u,v) = p{pi,qi). 



Let X = {Xi, . . . ,Xn) be a random vector on X"^ with distribution (^"^^^(si • jj) with 
Si G Vi for 1 < i < n and let x G M. Then 



^log{qi/p,)iXi)>2x 



i=l 



i=l 



2=1 



If X has distribution (^^^i{ui ■ fi) with Ui G Qi for 1 < i < n, then 



Y,^og{qi/pi)[Xi) < 2x 



,4 = 1 



1 = 1 



1=1 



Proof: We apply the previous proposition with t = 1/2, {X,W) = (0,-4) and M. 
the set of measures of the form u- p, u £ ^2ip) endowed with the weak L2-topology. 
In view of (j3.ip . Vi and Qi can be identified with two sets of probabilities and we 
can take for F the set of all positive functions such that log / is bounded. As a 
consequence, all four assumptions of Proposition [3] are satisfied. In order to get iii) 
we simply take for / a suitably truncated version of s/u when P = s- p and Q = u- fi. 
As to the probability bounds they derive from classical exponential inequalities, as 
for Lemma 7 of Birge (2006a). 



3.1.3 Abstract tests between L2-bans 

The purpose of this section is to prove the following result. 

— r 

Theorem 2 Let t,u £ for some T < +oo. For any x £ M, there exists a test 
4't,u,x between t and u, based on the randomized sample X' defined in Section \3.L1\ 
with a suitable value of X, which satisfies 

sup Ps[V'i,«,x(^0 =-"-]< exp 

{seh2 \ d2{s,t)<d2{t,u)/4] 

sup ¥s[ij;t,u,xiX') = t] < exp 

{seL2\d2is,u)<d2it,u)/'l} 



n 



u\\ 



+ x) 



65r 



(3.2) 



n 



65r 



(3.3) 
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Proof: It requires several steps. To begin with, we use the randomization trick of 
Yang and Barron described in Section 13.1.11 replacing our original sample X by the 
randomized sample X' = {X[, . . . ,X'^) for some convenient value of A to be chosen 
later. Each X'^ has density s' > 1 — A when Xi has density s. Then we build a test 
between t' = r(t) and u' = t{u) based on X' and Corollary [H To do this, we set 
A = ||t - 

V = T {Bd, {t, A/4) n L2) and Q = t {Ba, (u, A/4) n L2) . 

Then V is the subset of the ball Bd2{t', AA/4) of those densities bounded from below 
by 1 — A, hence d2{V, Q) > AA/2 and V is convex and weakly closed since any 
indicator function belongs to L2(/x) because // is a probability. Since Bd2{t' , XA/A) 
is weakly compact, it is also the case for V and the same argument shows that Q is 
also convex and weakly compact. It then follows from Corollary [1] that one can find 
t € V and u £ Q such that 



Y,log{u{Xl)/t{Xl))>2y 



< 



exp [—nh^ (t, u) 



(3.4) 



while 



^log{u{Xl)/tiXl))<2y 



< exp [-n/i^ {t, u) + y] if s G Q. 



(3.5) 



Fixing y = nx/(65r), we finally define iljt^u,x{^') by setting Tpt^u,xi^') = u if and 
only if X;r=i log iu{Xl)/t{Xl)) > 2y. Since s' £ V is equivalent to s € Bd2{t,A/4:) or 
d2is,t) < A/4 and similarily s G Q is equivalent to d2{s,u) < A/4, to derive (|3.2|) 
and (1231) from ([331) and ([33]), we just have to show that {t,u) > {65r)-^A'^. We 
start from the fact, to be proved below, that 



< 2(Ar + l-A). 



(3.6) 



It implies that 

h"^ {i, u) 



Vt — \/u) djj. 



1 



{i-uf 



> 



\t — u\ 



> 



2./ tyt + V^ 
(AA)2 



dfi 



16(Ar + 1 - A) - 64(Ar + 1 - A) ■ 



Choosing A close enough to one leads to the required bound /i^ {i,u) > (65r)~^A^. 
As to (|3.6p . it is a consequence of the next lemma to be proved in Section [7121 We 
apply this lemma to the pair t',u' which satisfies V u'||oo < AT + 1 — A. If ()3.6p 
were wrong, we could find t' £ V and u' G Q with h{t',u') < h{t,u), which, by 
Corollary [U is impossible. [] 

Lemma 1 Let us consider four elements t,u, vi,V2 in L2 with t ^ u, vi ^ V2 and 
\\t V tt||oo = B. If \\vi V f2||oo > 2i?, there exists v[,V2 G IL2 with d2{v[,t) < d2{vi,t), 
d2{v2,u) < d2{v2,u) and h{v[,V2) < h{vi,V2)- 
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3.2 The performance of T-estimators 

We are now in a position to prove an analogue of Corollary 6 of Birge (2006a). 

Theorem 3 Assume that we observe n i.i.d. random variables with unknown den- 
sity s E (L2,(i2) O'lT'd that we have at disposal a countable family of discrete subsets 
p 

{SmjmeM (^f^Qo for some given F > 1. Let each set Sm satisfy 

\Sm n Bd2{t,xrim)\ < exp [Dm^;^] for all X > 2 and t G L2, (3-7) 
with rjm > 0, Dm > 1/2, 



2 2T3rZ)^ — > 
r]^ > for all m £ Ai, ^^P 



1365r 



E < +00. (3.8) 



Then one can build a T-estimator s such that, for all s G L2, 

[dlis, s)] < Cg(S + 1) Jnf^ {d2{s, Sm) V rjmY , for all q > 1. (3.9) 

Proof: Since (j3.9p is merely a version of (7.6) of Birge (2006a) with d = d2, we 
just have to show that Theorem 5 of this paper applies to our situation. It relies 
on Assumptions 1 and 3 of the paper. Assumption 3 follows from (j3.7p . As to 
Assumption 1 (with a = n/(65r), B = B' = 1 and 6 = Ad2, hence k = 4), it is a 
consequence of our Theorem [2j The conditions (7.2) and (7.4) of Birge (2006a) on 
Tjm and Dm follow from (|3.8|) . [] 

In the case of a single D-dimensional model S C we get the following corollary: 

Corollary 2 Assume that we observe n i.i.d. random variables with unknown distri- 
bution Pg, s G (L2,(i2) and that we have at disposal a D-dimensional model S C L^^ 
for some given F > 1. One can build a T-estimator s such that, for all s G L2, 

Ej||s-sf]<C \nldl{s,t)+n-^DT 
Lies 

Proof: By Definition [T] and the remark following it, for each 770 > 0, one can find an 
7/0-net Sq <Z S for 5, hence Sq C L^^, satisfying (13. 7p with Dq = 2bD/A. Moreover 
£^(5,5*0) < r/o + d(s,5). Choosing r/g = 273 x 25rZ)/4, we may apply Theorem [3l 
The result then follows from (j3.9p with q = 2. q 

Theorem [3] applies in particular to the special situation of each model Sm being 
reduced to a single point {tm} so that we can take Dm = 1/2 for each m. We then 
get the following useful corollary. 

Corollary 3 Assume that we observe n i.i.d. random variables with unknown distri- 
bution Pg, s G (L2,d2) and that we have at disposal a countable subset S = {tm^meM 
p 

of for some given T > 1. Let {/S.m}m<^M ^ family of weights such that 
Am > 1/10 for all m e Ai and 

1 < ^ exp[-Am] = E < +00. (3.10) 
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We can build a T- estimator s such that, for all s G L2, 

\dl{s,s)\ < CqT. inf \d2{s,trn) V v^rA„/nj^ for all q>l. 

Proof: Let us set here Sm = {tm}-, Dm = 1/2 and rjm = 2>l^jT/S.m/n for m G A4. One 
can then check that (|3.7p and (|3.8p are satified so that (j3.9p holds. Our risk bound 
follows. [] 

At this stage, there are two main difficulties to apply Theorem [3] or Corollary [3l The 

Y 

first problem is to build suitable subsets Sm (or S) of L^^ from classical approximating 
sets (models), finite dimensional linear spaces for instance, that belong to L2(/u). We 
shall address this problem in the next section while we shall solve the second problem, 
namely choosing a convenient value for F from the data, in Section [5j 



4 Model selection with uniformly bounded models 



■r 

00 



4.1 The projection operator onto L 

p 

Our first task is to define a projection operator vrr from L2(/i) onto L^^ (T > 1) and 
to study its properties. In the sequel, we systematically identify a real number a with 
the function a\x for the sake of simplicity. The following proposition is the corrected 
version, by Yannick Baraud, of the initial mistaken result of the author. 

Proposition 4 Fort G L2(^) and 1 < F < +00 we set TTr{t) = [(t + 7)V0]AF where 
7 is defined by f[{t + j) V 0] A T dfi = 1. Then vrr is the projection operator from 

L2(/i) onto the convex set L^. Moreover, i/s G L2 and F > 2, then 

\\s-Ms)f < '^l~/_~^^ Qs{r) with Qs{z) = I^Js-zfdfi. (4.1) 

Proof: First note that the existence of 7 follows from the continuity and monotonicity 

P p 

of the mapping z >—>■ J[{t + z) y 0] AT dfj, and that Trr{t) G L^^^. Since L^^ is a closed 

— p 

convex subset of a Hilbert space, the projection operator tt onto L^^ exists and is 
characterized by the fact that 

{t - 7r(t), u - TT{t)) < for ah u G L^. (4.2) 

Since f{u - 7r(t)] d// = for -u G L^, (|iT2|) implies that {t + z - 7r(i),n - 7r(t)) < 

for z G M, hence 7r(t) = 7r{t + z). Since vrp(t) = 7rp(t + z) as well, we may assume 

that J[t V 0] A F d/i = 1, hence 7rp(t) = [t V 0] A F and 7rp(t) = t on the set < t < F. 
— p 

Then, for u G L^^, 

{t-TTr{t),u-TTT{t)) = / tudfi+ (t -T){u-T)dn <0, 

Jt<o Jt>r 

since < u < F. This concludes the proof that vr = vrp. 
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Let us now bound ||s — vrr(s)|| when s G L2, setting s = s AT + v with v = {s — 
r)ls>r. Since there is nothing to prove when ||s||oo < T, we assume that J vdfi > 0. 
By Cauchy-Schwarz Inequahty, 



(^J vdi^j < n{{s > r}) J v^dfi< r-^ii^f . 

Moreover, since J s AT dfi < 1, 7rr(s) = (s + 7) A F with < 7 < 1. Hence 

1 = j [{s + 7) A r] d/i > j{sAT)dn + 7/i({s < r - 7}) 

> 1 - /" vdfi + 'jll-—^ — ) > 1- [vdfi + 'j^ ^ 



(4.3) 



r-7y J ^ T-i 

and 7 < (r — l)/(r — 2) J t> dfi. Now, since < 7rr(s) — s < 7 for s < F, 

||s — 7rr(s)||^ = / [7rr(s) — s]^ + < 7 / [vrr(s) — s] d/z + ||f | 

Js<r Js<r 



< 



< 



T - 1 

r -2 
r - 1 



V dfi ] / [s — Trr{s)]dfj. + \\v 
Js>r 
2 



vdfi] + \\v\f < 1 + 



r - 1 



Y _2 \j -"^j ' - ' r(r-2) 

where we used (|4.3p . This concludes our proof. [] 

4.2 Selection for uniformly bounded countable sets 

We consider here the situation mentioned in Corollary [3] but without the assumption 

p 

that S C Lqo. For S = {tm}meM arbitrary countable subset of ]L^(^) we may 
always replace it by its projection 7rr(S') onto L^^ and apply Corollary [3l The resulting 
risk bound involves 

d2 {s, vrr(tm)) < (s, vrr(s)) + ^2 (7rr(s), 7rr(tm)) 



< 



r - 1 



r(r - 2) 



1/2 



Q,(r) +d2{s,t^) 



by Proposition m We finally get: 



Corollary 4 Assume that we observe n i.i.d. random variables with unknown density 
8 G {h,2:d2) and that we have at disposal a countable subset S = {tm}meM o/L^(;u) 
and a family of weights {Am}m^_M such that Am > 1/10 for all m £ M and li':^.l(J\} 
holds. Given T > 3 we can build a T-estimator with values in tty{S) such that, for 
all s G L2, 



E, 



I -rii9 
s — s 



< C,E inf I \d2{s, tm) + V(3s(r)l V ^TAm/nV for q > 1 
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4.3 Selection with uniformly bounded models 

Typical models S for density estimation in l^2ip) are finite-dimensional linear spaces 

— r 

which are not subsets of L^^ but merely spaces of functions with nice approximation 
properties. To apply Theorem [3] we have to replace them by discrete subsets of L^^ 
that satisfy ()3.7p . Unfortunately, they cannot simply be derived by a discretization 
of S followed by a projection vrr or a discretization of vrp (5). A more complicated 
construction is required to preserve both the metric and approximation properties of 
S. It is provided by the following preliminary result. 

Proposition 5 Let S be a subset o/L2(;^) with metric dimension bounded by D. For 
r > 2 andrj > 0, one can find a discrete subsets' ofL,^ with the following properties: 

|5'n5d2(t,xr/)| < exp [9Z?j;^] for all x > 2 and t el.2{p)i (4-4) 
for any s G L2, one can find some s' G S' such that 



r] + inj^ \\s — t\ 



r2_r-i 



+ 4-1 ( r(r-2) • (^-^^ 



\\s - s'W < 3.1 

Proof: According to Definition [H we choose some 7/- net S*^ for S such that (j2.2p holds 

p 

for all t G L,2{fi). Since, by Proposition [H the operator vrp from L2(/i) to L^^ satisfies 

p 

\\u — vrp(t)|| < \\u — t\\ for all u G L^^, we may apply Proposition 12 of Birge (2006a) 

p 

with M' = L2(^), d = d2, Mq = h^, T = Sri, vr = vrp and A = 1. It shows that one 
can find a subset S' of 7rp(S'^) such that (j4.4p holds and d2{u,S') < 3.1d2{u, Srj) for 
all u G Lgj^. If s is an arbitrary element of L2, then 

d2 {Ms),S') < 3.1^2 {nr{s),Sr,)<3.1 [^2 (^p(s), s) + ^2 {s,S)+rj] , 

hence 

d2{s,S') <3.l[d2{s,S) +r]] +A.ld2{Ms),s). (4.6) 
The conclusion follows from Proposition HI 

We are now in a position to derive our main result about bounded model selection. 
We start with a countable collection {5m, ""t- £ -A^} of models in I^2ip) with metric 
dimensions bounded respectively by Dm > 1/2 and a family of weights satisfying 
(|3.10p . We fix some F > 3 and, for each m G A^, we set 



50\ Drn V 37a/ A, 



vT/n. 



By Proposition [5] (with 77 = r]m), each 5m gives rise to a subset S^ which satisfies 
(13. 7p with Dfn = 9Z?m- It follows from our choice of rjm that (13. Sp is also satisfied so 
that we may apply Theorem [3] to the family of sets {S^,^ G This results in a 

T-estimator such that, for all s G L2, 

[dl{s,s^)] < CgS inf {d2{s,Si)VVmy for q > 1. 
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We also derive from Proposition [5] that 



d2{s,Si) <3.1 



7]m+ inf \\s-t\ 

teSm 



+ 4.lV(5/3)Q,(r). 



Putting the bounds together and rearranging the terms leads to the following theorem. 

Theorem 4 Given a countable collection {S'm,??^ G -M} of models in L2(/i) with 
metric dimensions bounded respectively by Dm > 1/2 and a family of weights Am 
satisfying \3.10\) . one can build, for each T > 3, a T-estimator s^ which satisfies, for 
all s S L2 and g > 1, 



s — s 



ri|9 



inf < d2(s,Sm) + 



n 



, (4.7) 



with Qs given by ^.1^ and Cq some constant depending only on q. If ||s||oo < P, then 



\s — s 



ri|2' 



< CS inf {dl (s, Sm) + n-ip {Dm V A^) } . (4.8) 



5 A selection theorem 

Now that we are able to build models which are bounded by P for each P > 3 and to 
select one of these models, which results in an estimator s^, we need a way to choose 
P from the data in order to optimize the bound in (14. 7p . The idea is to use one half 
of our sample to build a sequence of estimators s'^' and select a convenient value of i 
from the other half of our sample. This requires to select an element from a sequence 
of densities which is not uniformly bounded. 



5.1 A preliminary selection result 

We start with a general selection result, to be proved in Section 17.31 that we state 
for an arbitrary statistical framework since it may apply to other situations than 
density estimation from an i.i.d. sample. We observe some random object X with 
distribution Ps on X where s belongs to a metric space M (carrying a distance d) 
which indexes a family V = {Pt,t G M} of probabilities on X. 

Theorem 5 Let (tp)p>i be a sequence in M such that the following assumption holds: 
for all pairs {n,p) with 1 < n < p and all x S M, one can find a test ^t„,tp,x based on 
the observation X and satisfying 

sup P,[^t„,tp,^(X) = g < Sexp [-a2-Pd^{tn,tp) - x] ; (5.1) 

{sGM|c((s,t„)<d{t„,tp)/4} 

sup P,[Vt„,tp,x(X) = tn] < Bexp [-a2~Pd\tn,tp) + x] ; (5.2) 

{seM\d{s,tp)<d{t„,tp)/A} 

with positive constants a and B independent of n,p and x. For each A> 1, one can 
design an estimator sa such that, for all s S M, 



Es[d«(sA,s)] <-BC7(yl,g)inf d(s,i„) V^o^^Vp /or 1 < g < 2^/ log 2. (5.3) 

p>i I J 



16 



This general result applies to our specific framework of density estimation based on 
an observation X with distribution Pg, s G L2, provided that the sequence {tp)p>i 
be suitably chosen. We shall simply assume here that tp G L2 with ||tp||oo < 2^^^ 

for each p > 1. This implies that, for 1 < i < j, ti and tj belong to L^^ so that 
Theorem [2] applies with X replaced by the randomized sample X' and the assumption 
of Theorem [5] is therefore satisfied with d = d2, B = 1 and a = n/65, leading to the 
following corollary. 



Corollary 5 Let (ti)i>i be a sequence of densities such that ti G L^^ for each i, 
A> \ and X he an n-sample with density s G L2. One can design an estimator sa 
such that 



Id 



2\SA,s)] < C{A,q) inf 



d2{s,ti) V Vn-ii2* /or 1 < g < 2^/ log 2. 



5.2 General model selection in L2 

We now consider the general situation where we observe n = 2n' i.i.d. random vari- 
ables Xi, . . . , Xn with an unknown density s G L2, not necessarily bounded, and 
have at disposal a countable collection {5*^,771 G Ai} of models in L2(^) with metric 
dimensions bounded respectively by Dm > 1/2 and a family of weights which 
satisfy (|3.10p . We split our sample X = (Xi, . . . into two subsamples Xi and 
X2 of size n'. With the sample Xi we build the T-estimators Si{Xi) = s^''^ {Xi), 
i > 1 which are provided by Theorem [3l It then follows from (j4.7p that each such 
estimator satisfies, for (7 > 1, 



Si{Xi 



< aS< inf 



d2{s,Sm) + 



1/2- 



n 



with Qs given by (j4.ip . We now work conditionally on Xi, fix a convenient value 
of ^4 > 1 (for instance A = 1 if we just want to bound the quadratic risk) and use 
the second half of the sample X2 to select one estimator among the previous family 
according to the procedure described in Section 15. 1[ By Corollary [5] this results in a 
new estimator SAiX) which satisfies 



E, \d'i{sA{X),s) I Xi] < C{A, q) inf ^2(5, s^{Xl)) V Vn-Hl^ 

i>l I 



provided that q < 2A/log2. Integrating with respect to Xi and using our previous 
risk bound gives 

E,[\\s-sa{XW] 

< C{A, q) inf |e, [\\s - HX^W] + [71-^2')''^] 



< C'{A,q)^\ni{ inf 



dl (s,5m) + 



'2^ (L>^ V A^ Vi)' 



q/2 



n 



i+l\l9/2 
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For 2* < z < 2*"^^, logz > ilog2 and Qs{z) > (5s(2*^^) since Qs is nonincreasing. 
Modifying accordingly the constants in our bounds, we get the main result of this 
paper which provides adaptation to both the models and the truncation constant. 

Theorem 6 Let X = (Xi, . . . , Xn) with n > 2 be an i.i.d. sample with unknown 
density s G L2 and {Sm^m £ -M.} be a countable collection of models in L2(/i) with 
metric dimensions bounded respectively by > 1/2. Let {Am,m £ A4} be a family 
of weights which satisfy 13. 1 0\) and Qs{z) be given by For each A > 1, there 

exists an estimator s^(X) such that, whatever s G L2 and 1 < q < (2^4/ log 2), 



Es[\\s-SAiX) 



< C(yl,g)i;inf inf 

z>2m€M 



dl (s, Sm) + 



Z [Dm V Am V log z) 



q/2 



n 



+ [Qs{z, 



19/2 



(5.4) 



In particular, for s = si and s E Loo(/u), 
1 2 



sX) 



<CS inf +n-i||s||oo (I^m V A„ Vlog||s||oo)] . (5.5) 



5.3 Some remarks 



We see that (j5.4p is a generalization of ()4.7|) and (j5.5|) of (j4.8p at the modest price 
of the extra logz (or log ||s||oo)- We do not know whether this logz is necessary or 
not but, in a typical model selection problem, when s belongs to Loo(Ai) but not to 
UmeMSm, the optimal value of Dm goes to +00 with n, so that, for this optimal 
value, asymptotically Dm V Am V log ||s||oo = L)m V A^. 

Up to constants depending on ||s||oo, (15. 5p is the exact analogue of (|2.5p which 
shows that, when s G Loo(/u), all the results about model selection obtained for the 
Hellinger distance can be translated in terms of the L2-distance. 

Note that Theorem [6] applies to a single model S with metric dimension bounded 
by D, in which case one can use a weight A = 1/2 < D which results, ii A = 1, in 
the risk bound 



s(X) 



< C 



dl (s,5) + inf 



z{D\J log z) 



n 



+ Qs{z) 



and, if s G Loo(Ai), 



s - s(X)f 1 < C [dl{s, S) + n-i||s||oo [D V log ||s||oo)] 



(5.6) 



(5.7) 



we 



Apart from the extra log ||s||oo) which is harmless when it is smaller than D 
recover what we expected, namely the bound ()2.8p . 

Even if s G Loo(/^) the bound (15.41) may be much better than (15. 5p . This is actually 
already visible with one single model, comparing (|5.6p with (j5.7p . It is indeed easy 
to find an example of a very spiky density s for which (|5.6p is much better than ()5.7p 
or the classical bound (j2.6p obtained for projection estimators. Of course, this is just 
a comparison of universal bounds, not of the real risk of estimators for a given s. 

More surprising is the fact that our estimator can actually dominate a histogram 
based on the same model, although our counter-example is rather caricatural and 
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more an advertising against the use of the L2-I0SS than against the use of histogram 
estimators. Let us consider a partition X of [0, 1] into ID intervals Ij, 1 < j < 2D with 
the integer D satisfying 2 < D < n and fix some 7 > 10. We then set a = (7^ri) ^ . 
For 1 < j < D, the intervals l2j-i have length a while the intervals l2j have length 
P with D{a + /?) = 1. We denote by S the 2L>-dimensional linear space spanned by 
the indicator functions of the Ij. It is a model with metric dimension bounded by D. 
We assume that the underlying density s with respect to Lebesgue measure belongs 
to S and is defined as 



D 



D 



s = pa ^ ^hj-i + 9/^ ^ ^h] with p = 7a and D{p + q) = 1, 
i=i j=i 

so that (3 > q since a < p. We consider two estimators of s derived from the same 
model S: the histogram sj based on the partition X and the estimator s based on S 
and provided by Theorem [6l According to (jl.2p the risk of sx is 

Dn-^ [a-^p{l-p) + P'^qil - q)] > O.dDn-^a'^p = 0.9D-fn-\ 

since p < 1/10. The risk of s can be bounded by ()5.4p with z = 4 which gives 



s{X) 



< C 



iDn'^ +D {p/af dfi 



h 



CD [An 



-'+p'a-'] 



bCDn 



For large enough values of 7 our estimator is better than the histogram. The problem 
actually comes from the observations falling in some of the intervals l2j-i which will 
lead to a very bad estimation of s on those intervals. Note that this fact will happen 
with a small probability since Dp = D{'yn)~^ < 7"^. Nevertheless, this event of small 
probability is important enough to lead to a large risk when we use the L2-I0SS. 



6 Some applications 

6.1 Aggregation of preliminary estimators 

Theorem [6] applies in particular to the problem of aggregating preliminary estimators, 
built from an independent sample, either by selecting one of them or by combining 
them linearily. 

6.1.1 Aggregation by selection 

Let us begin with the problem, that we already considered in Section [4. 21 of selecting 
a point among a countable family {tm,rn £ M.}. Typically, as in Rigollet (2006), the 
tm are preliminary estimators based on an independent sample (derived by sample 
splitting if necessary) and we want to choose the best one in the family. This is a 
situation for which one can choose Dm = 1/2 and A = I which leads to the following 
corollary 

Corollary 6 Let X = (Xi,...,X„) with n > 2 be an i.i.d. sample with unknown 
density s € L2 and {tm,iTi G Ai} be a countable collection of points in L2(/i). Let 
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{A.m,rn £ Ai} be a family of weights which satisfy 113. 10\) and Qs{z) he given by 
There exists an estimator s{X) such that, whatever s G L2, 



'\\s-s{X)f 


< CS inf < 


inf 




2>2 





4{ 



) + 



z{Ani V logz) 



n 



Qs{z) 



6.1.2 Linear aggregation 

Rigollet and Tsybakov (2007) have considered the problem of hnear aggregation. 
Given a finite set {ti, . . . ,ti\f} of prehminary estimators of s, they use the observations 
to build a linear combination of the tj in order to get a new and potentially better 



estimator of s. For A = (Ai,...,Aiv) S IB 
and Tsybakov build a selector X{Xi, . . . ,X. 



N 



let us set 



s{X) 



t^ satisfies, for all s G 



N 



i^jtj. Rigollet 



such that the corresponding estimator 



E. 



s(X) 



< 



inf d2 



s,tx)+n \\s\ 



(6.1) 



Unfortunately, this bound, which is shown to be sharp for such an estimator, can 
be really poor, as compared to the minimal risk infi<j<jv d2{s, tj) of the preliminary 
estimators when one of those is already quite good and n~^||s||oo-/V is large, which 
is likely to happen when A'^ is quite large. Moreover, this result tells nothing when 
s Loo. In Birge (2006a, Section 9.3) we proposed an alternative way of selecting 
a linear combination of the tj based on T-estimators. In the particular situation of 
densities belonging to L2, we proceed as follows: we choose for A4 the collection of 
all nonvoid subsets m of {1, ... , A'^} and, for m G A4, we take for Sm the linear span 
of the tj with j € m so that the dimension of Sm is bounded by |m| and its metric 
dimension by \m\/2. Since the number of elements of Ai with cardinality j is 

^ ^ < (eN/jy, we may set = \m\[2 + log(A^/|m|)] so that dXTO]) is satisfied 

with T, < 1. An application of Theorem [6] leads to the following corollary. 

Corollary 7 Let X = (Xi, . . . , Xn) with n > 2 be an i.i.d. sample with unknown 
density s G L2 and {ti, . . . jtj^} be a finite set of points in L2(/i). Let M be the 
collection of all nonvoid subsets m 0/ {1, ... , A^} and, for m £ Ai, 

A„ = { A G I Aj = for j ^ m} . 

For each A > 1, there exists an estimator SAiX) such that, whatever s G L2 and 
l<q<{2A/log2), 



E. 



sa{X) 



\'^]<C{A,q)M inf R{q,s,z,m) 

z>2 m£M 



where 



inf ^2 
AgA„ 



R{q, s, z, m 
and Qs{z) is given by l^.l 



s,tx) + 



' z [\m\ (1 + log(A^/|m|)) V logz] 
n 



q/2 



+ [Qs{z] 



19/2 
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There are many differences between this bound and ()6.ip . apart from the nasty con- 
stant C{A, q). Firstly, it applies to densities s that do not belong to Lqo and handles 
the case of g > 2 for a convenient choice of A. Also, when s £ Loo and one of the 
preliminary estimators is already close to s, it may very well happen, when is large, 
that 

R{2,s,\\s\\oo,m) < inf 4{s,tx) +n-^||s||oo (l + log(7V/|m|)) Vlog||s||oo] 
be much smaller than the right-hand side of (|6.ip for some m of small cardinality. 



6.2 Selection of projection estimators 



In this section, we assume that s G Loo(/^)- This assumption is not needed for the 
design of the estimator but only to derive suitable risk bounds. We have at hand a 
countable family {5m, m G Al} of linear subspaces of L2(/i) with respective dimen- 
sions Dm and we choose corresponding weights Am satisfying (j3.10p . For each m, we 
consider the projection estimator Sm defined in Section [221 Each such estimator has 
a risk bounded by ()2.6p . i.e. 



where Sm denotes the orthogonal projection of s onto Sm- If we apply Corollary [6] to 
this family of estimators, we get an estimator s{X) satisfying, for all s G Lqo, 



~s{X) 



< CT. inf 



+ n ^||s||oo (-Dm V Am V log ||s||oo)] 



With this bound at hand, we can now return to the problem we considered in Sec- 
tion [TTTl starting with an arbitrary countable family {Tm, m G A4} of finite partitions 
of X and weights Am satisfying (j3.10p . To each partition Zm we associate the linear 
space Sm of piecewise constant fonctions of the form X^/gj^ Z?/!/. The dimension 
of this linear space is the cardinality of Im and its metric dimension is bounded by 
|2m|/2. If we know that s G Loo(/x), we can proceed as we just explained, building the 
family of histograms si^(Xi) corresponding to our partitions and using Corollary [6] 
to get 



E. 



~s{X) 



< CS inf 



\sj^-s\[^+n ^||s||oo (IXml V Am V lo. 



(6.2) 

which should be compared with ()1.3p . Apart from the unavoidable complexity term 
Am due to model selection, we have only lost (up to the universal constant C) the 
replacement of \Im\ by \1m\ Vlog ||s||oo. Examples of families of partitions that satisfy 
(IS.lOp are given in Section 9 of Birge (2006a). 

In the general case of s G L2 (//) , we may apply Theorem [6] to the family of linear 
models {^Sm^'nn G A^} derived from these partitions, getting an estimator s with a 
risk satisfying 



E, 



'\\s-s{X)f 


< CS inf < 


inf 




z>2 





2 , z{\lm\ V Am V \ogz) 

n 



+ Qs{z) 
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6.3 A comparison with Gaussian model selection 



A benchmark for model selection in general is the particular (simpler) situation of 
model selection for the so-called white noise framework in which we observe a Gaussian 
process X = {X^,z G [0,1]} with = s{x)dx + aWz, where s is an unknown 
element of L2([0, 1], dx), cr > a known parameter and Wz a Wiener process. For 
such a problem, an analogue of Theorem [T] has been proved in Birge (2006a) , namely 

Theorem 7 Let X be the Gaussian process given by 



where s is an unknown element of 1^2 {[0,1], dx) to be estimated and Wz a Wiener 
process. Let {Sm,TTT' £ -A^} be a countable collection of models in IL2(^) with metric 
dimensions bounded respectively by Dm > 1/2. Let {Am,m € Ai} be a family of 
weights which satisfy 113. 10\) . There exists an estimator s{X) such that, whatever 



Comparing this bound with (|5.5p shows that, when s G Loo(/^), we get a similar 
risk bound for estimating the density s from n i.i.d. random variables, apart from 
an additional factor depending on llslloo. Similar analogies are valid with bounds 
obtained for estimating densities with squared Hellinger loss or for estimating the 
intensity of a Poisson process as shown in Birge (2006a and 2007). Therefore, all the 
many examples that have been treated in these papers could be transferred to the case 
of density estimation with L2-I0SS with minor modifications due to the appearence of 
II s II 00 in the bounds. We leave all these translations as exercices for the concerned 
reader. 

6.4 Estimation in Besov spaces 

The Besov space -B^ oo([0, 1]) with a,p > is defined in DeVore and Lorentz (1993) 
and it is known that a necessary and sufficient condition for Bp^^{[0, 1]) C L2([0, 1], dx) 
is 5 = a+l/2 — l/p > 0, which we shall assume in the sequel. The problem of estimat- 
ing adaptively densities that belong to some Besov space 5^ ,^([0, 1]) with unknown 
values of a and p has been solved for a long time when a > 1/p which is a necessary 
and sufficient condition for Bp^^{[0,l]) C Loo([0, 1], dx). See for instance Donoho, 
Johnstone, Kerkyacharian and Picard (1996), Delyon and Juditsky (1996) or Birge 
and Massart (1997). It can be treated in the usual way (with an estimation of ||s||oo) 
leading to the minimax rate of convergence 72~2a/{2a+i) £qj, ^.j^g quadratic risk. 

6.4.1 Wavelet expansions 

It is known from analysis that functions s S L2 ([0, l],dx) can be represented by their 
expansion with respect to some orthonormal wavelet basis {^j^k,j ^ —l,k S A(j)} 




0<z<l 



s G h2{[0,l],dx) 



Es \\s-s{X)f <C inf [dl{s,Sm)+n-^Dmy Am)]. 
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with |A(-1)| < K and 2^ < \A{j)\ < K2^ for all j > 0. Such a wavelet basis satisfies 



J2 ^j-.* 

fceA(i) 



< K'2J/2 for j > -1, 



(6.3) 



and we can write 



j=^i keA(j) 



with 



Pj,k = J (Pj,kix)s{x) dx. 



(6.4) 



Moreover, for a convenient choice of the wavelet basis (depending on a), the fact that 
s belongs to the Besov space -B" ^^([0, 1]) is equivalent to 



i/p 



sup2^-(-+V2-i/p) ^ i^^.^i 



|'S|a,p,oo ^ ~l~00, 



j>0 



(6.5) 



. fceA(i) 



where |s|a^p^oo < +00 is equivalent to the Besov semi-norm \s\p. 

Moreover, it follows from Birge and Massart (1997 and 2000), as summarized in 
Birge (2006a, Proposition 13), that, given the integer r, one can find a wavelet basis 
(depending on r) and a universal family of linear models {5m, m G M. = Uj>oA4j} 
with respective dimensions Dm, and weights {A^, m £ Ai} satisfying (j3.10p . with the 
following properties. Each 5m is the linear span of {^p-i^k, k E A{ — l)}U{ipj^k, {j, k) E 
m} with m C Uj>oA(j); Dm V Am < 02"^ for m G j and 



-Ja\ 



\ct,P,oo 



for s £ B^^^{[0,1]), a<r. (6.6) 



inf inf \\s - t\\ < C{a,p)2 
6.4.2 The bounded case 

Actually, only the assumption that s £ Bp ,^{[0,l]) nLoo(At), rather than a > 1/p, 
is needed to get the optimal rate of convergence 72-20/(20+1) _ indeed, we may apply 
the results of Section 16.21 to the family of models which satisfies ()6.6p and derive an 
estimator s with a risk bounded by 



E, 



s{X) 



< C(a, p) inf 

J>0 



2 (I s| o,p,oo) ~l~ 



Midi ^'^■^ 



2-^ Vlog||s||oo) 



Choosing 2"^ of the order of n^/('^"+^^ leads to the bound 



s — s{X)\\'^ <C{a,p,\s\ 



a,p,OD ) 



I n 



-2o/(2q!+1) 



which is valid for all s £ Bp ,^{[0, 1]) nLoo(/u), whatever a < r and p and although a, 
p, \s\p and ||s||oo are unknown. 

6.4.3 Further upper bounds for the risk 

When a < 1/p, i.e. < S < 1/2, s may be unbounded and the classical theory does 
not apply any more. As a consequence the minimax risk over balls in Bp ,^([0, 1]) is 
presently unknown. Our study will not, unfortunely, solve this problem but, at least. 
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provide some partial information. In this section we assume that a < 1/p and, as 
usual, restrict ourselves to the case p < 2 so that 6 < a. We consider the wavelet 
expansion of s which has been described in Section 16.4.11 and, to avoid unnecessary 
complications, we also assume that |s|q^p^oo ^ 1- In what follows, the generic constant 
C (changing from line to line) depends on the choice of the basis and 6. 
Since p <2, 



\ 1/2 / \ l/p 

. fceA(i) / V fcGA(j) 

hence, for q > —1 an integer, 



< I si 2-^'("+i/2-i/p) 



< r"?~'}^\<i\ 



■7) 



j>q keA(j) 

The simplest estimators of s are the projection estimators Sq over the linear spaces 
S'lj where 5^ is spanned by {ipj,k, —^<j<Q,k^ ^(j)} 

q n 

■Sg(^) = E E k,k{X)Lpj,k, with = n~^^v?j,fe(Xi), 

j=-i keA{j) i=i 

The risk of these estimators can be bounded using (j2.6p and (j6.7p by 



s - s,{X)f] < dl (s, S'q) + C2yn < C \2-'''\s\lp,^ + 27n 



E, 



A convenient choice of q, depending on 5, then leads to 



\s-SgiX)f <C\s 



I a,p,OD 



n 



-2<5/(2<5+l) 



One can actually choose q from the data using a penalized least squares estimator 
and get a similar risk bound without knowing 6 as shown by Theorem 7.5 of Massart 
(2007). This is the only adaptation result we know for the case a < l/p without the 
restriction s E Loo([0, 1]). 

Let us now see what our method can do. Since s is a density, it follows from (j6.4p 
and (16. 3p that < ||<y5_i,A:||oo < K'/\/^, hence 



fceA(-i) 



< K'/V2 



J2 ^-^'^ 

feeA(-i) 



< K''^/2. 



Moreover, for j > 0, (j6.5p implies that sup^g^^j) |/?j,/c| < 2 •''^|s|q,^p^oo- Therefore, by 
(lOI), 



kGA{j) 



< r;'?-^(°'-'^/p)\q\ 



and, for J > 0, 
J 



j=0 fcGA(i) 



C\s\a,p,oo if a > l/p; 

< <( C(J + l)|sU,p,oo if a = l/p; 



C2-^(Vp-")|s| 



a,p,oo 



if a < l/p. 
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Finally, 
J 

j=-i fceA(j) 



< CoLj\s\a,p,oo with Lj = < 



'1 if a > l/p; 

( J + 1) if a = l/p; 
2^(i/p-") if ct < l/p. 



Observing that if s = u + v with ||u||oo ^ -2, then Qs{z) < ||f |p, we can conclude from 
(16:711 that 



ci,p,oo 

)<C'2 



/o-2J5| „|2 



la,p,oo' 



Let us now turn back to the family of linear models described in Section 16.4.11 that 
satisfy (j6.6|) . Theorem [6] asserts the existence of an estimator s{X) based on this 
family of models and satisfying 



s{X) 



< C inf inf 

z>2 m£M 



r, , — . Z (Dm V Am V log Z, 
4 {S, Sm) + — ^ + Qs{z) 



n 



Given the integers J, J', we may set z = zji = CQLji\s\a,p,oo and restrict the mini- 
mization to m € Aij which leads to 



E, 



six) 



< C 



-2,Ja 



+ 2 



-2J'S 



a,p,oo 

(2 Vlogzj/] 



Since Lji {2^ V log zj') is a nondecreasing function of both J and J', this last bound 
is optimized when Ja and J' 5 are approximately equal which leads to choosing the 
integer J' so that Ja/5 < J' < Ja/5+\, hence 2~'^^^ ^ < 2~^"^". Assuming, moreover, 
that 2-^ > loglsl 

o,p,oo) which implies that 2"^ > C'log2:j', we get 



E. 



s - s{X)f] < C\s\lj,,^ [2-2^" + 2-^ (n|sU,p,oo)-' 



We finally fix J so that 2'^ >G > 2^ ^ , where G is defined below. This choice ensures 



that G > log |s|q,^p^oo 
here. 



for n large enough (depending on \s\ 



a,p,oo J ) 



which we assume 



If a > l/p we set G = {n\s 



a,p,oo 



ji/(2Q!+i) leads to a risk bound of the form 



(j^-2a/{2a+l) ^-|^|^ ^ ^^(2a+2)/(2a+l) _ 

— If a = l/p, L'j < Ja/6 + 2 and we take G = {n\s\a,p,oo/ ^ogn)^^^'^"'^^^ which leads 
to the risk bound 



C(n/logn)-2"/(2"+i) (|. 



I a,p,OD J 



^(2a+2)/(2o+l) 



■ Finally, for a < l/p, Lj/ < ^2(^"/^)(Vp-") and we set G = {n\s\ ^i/["+i+"/(25)] 



\a,p,ooJ 



which leads to the bound 

^^-2a/la+l+a/(26)] ^ (2+(a/5)/[a+l+Q/(25)] 
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6.4.4 Some lower bounds 



Lower bounds of the form fj~2"/(i+2o) ^.j^^ minimax risk on Besov balls are well- 
known (deriving from lower bounds for Holder spaces) and they are sharp for a > 
1/p, as shown in Donoho, Johnstone, Kerkyacharian and Picard (1996). To derive 
new lower bounds for the case a < 1/p we introduce some probability density / G 
-^p,oo([Oi 1]) with compact support included in (0, 1) and Besov semi-norm \ f\p. Then 
we set g{x) = af{2anx) for some a > (2n)~^ to be fixed later. Then g{x) = for 
X (0, (2an)-i), 

= a{2an)-'/^\\f\\, and \g\; = a(2an)°-VP|/|^. 

Let us now set t = g+ [l — (2n)~^] l[o^i] , so that t is a density belonging to B°^{[0, 1]) 
with Besov semi-norm 

= 1^1° = ifai+°-i/Pj^«-Vp with K = 2"-i/P|/|;^. 

For a given value of the constant K' > 0, the choice a = [iC'n^/^^"] ^^^^^"^ ^^^^ > 
{2n)~^ (at least for n large) leads to \t\p = KK' so that K' determines \t\p. We also 
consider the density u{x) = t[\ — x) which has the same Besov semi-norm. Then 



h^{t,u) 



(V5+[l-(2n)-i]-x/l-(2n)-i) < g = {2n)-\ 



and it follows from Le Cam (1973) that any estimator s based on n i.i.d. observations 
satisfies 

max{Et sf] ,E„ [||n-sf]} > C\\t - u\\^ = 2C\\g\\^ = Can-^\\f\\\ 
Since an'^ = i^'i/(<5+i/2)„-2V{5+i/2) ^ ^^^^^ 

max {E, [\\t - sf] ,E„ [||^ - sf]} > C {\trpf^''^'^ n-^'/^''+'\ 

where C depends on K' , \\f\\, \f\p and 6. One can check that this rate is slower than 
^-2a/{i+2a)^ when < 6 < a[2{a + l)]~^ or, equivalently, when a-h [2(a-M)]"^ < 1/p. 



6.4.5 Conclusion 

In the case a > 1/p, the estimator that we built in Section [6.4.31 has the usual rate 
of convergence with respect to n, namely 72~2a/{2a+i)^ which is known to be optimal, 
and we can extend the result to the borderline case a = 1/p with only a logarithmic 
loss. The situation is different when a < 1/p for which, to our knowledge, the value 
of the minimax risk is still unknown. The rate j7,-2a/[a+i+a/{2(5)] ^j^g^^ -^g gg^ worse 

than the one valid for a > 1/p and also than the lower bound n~^^^^'^^^^^ that we 
derived in the previous section. It can be compared with the risk of the penalized least 
squares estimators based on the nested models S^, which is, as we have seen, bounded 
by Our rate is better when a > 25/{25 + 1), which is always true for 

a > 1/2 since S < 1/2. When a < 1/2, hence p > 2/(2a -|- 1) > 1, it also holds for 
p < 2(1 - a)/ (l - 2a2) < a'^ . We are convinced that our rate is always suboptimal 
in the range a < 1/p but are presently unable to derive the correct minimax rate. 
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7 Proofs 

7.1 Proof of Proposition [2] 

For simplicity, we shall write h{9, A) for h{se, sx) and analogously d2{9, A) for d2{se, sx) 
Let us first evaluate h'^{9,X) for < 6* < A < 1/3. Setting = [9"^ + 9 + l] 
[9/13, 1), we get 



-1 



2h^{9,X) 



^^y/se{x) - ^ysxix)J dx 

03 (0-1 _ ^-1)2 ^ (^3 _ ^3) U-1 _ + (1 - A3 



{X-9)j[l--]+{X-9) 



2 



(1-A= 



Note that the monotonicity oi 9 ^ [3q implies that 



A/9 <[l-Xy/We) <1, v^+\/^> 2 J/3i/3 = 6/^/13 



and 



< /3e - /3a 



< A 



It follows that 
0< {^/We 

and 

0< fl-A^ 



{X-9){X + 9 + l) 
(6'2 + + l)(A2 + A + l) 

^^^^<i(A-^)^ = ^(A-^)(l 



(7.1) 



r- r—\2 13A(l-A3)^^ ^/ 9\ 2iX-9)f 9 

We can therefore write 

G = 2(A - 9)-^h^{9, X) = z{l -z)+ ci{9, A) (l + z + z^) + C2{9, A)(l - z), 

with z = 9/Xe (0, 1), 4/9 < ci{9,X) < 1 and < 02(6*, A) < 2/17. Since, for given 
values of ci and C2, the right-hand side is increasing with respect to z, 4/9 < ci < 
G < 3ci < 3 and we conclude that for all 9 and A in (0, 1 /3] , 

/i2(0^ A) = C{9, X)\9 - X\ with 2/9 < C{9, A) < 3/2. 

It immediately follows that the set S*^ = {sxj, j > 0} with Xj = (2j + l)2r/^/3 is an 
77-net for the family S. On the other hand, given A G (0, 1/3) and r > 2ri, in order 
that sXj S B{sx,r), it is required that h?{Xj, A) = C{Xj, X)\Xj — X\ < which implies 
that |Aj - A| < (9/2)r2 and therefore 

\Sr, n B{sx,r)\ < 1 + (27/4)(r/?7)2 < exp [OM{r/r]f] for all sx e S. 
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It follows from Lemma 2 of Birge (2006a) that S has a metric dimension bounded by 
3.4 and Corollary 3 of Birge (2006a) implies that a suitable T-estimator s built on 
Sr^ has a risk satisfying 

E,, [h'^is, s)] < Cn-^ for all se G S. 
Let us now proceed with the L2-distance d2- 

4(9, X) = e^9-' - X-'f + {X^ - e^) - Pef + {I - X') {f3e - Px? 




we conclude that 



G = {9-^ - X-^ 4(9, A) = (1 - z){l + zf + ci{e, A) {z + z^ + z^) + C2{9, A)(l - z), 
with z = 9/Xe (0, 1), 8/9 < ci{9, A) < 1 and < 02(6*, A) < 1/27. It follows that 
l<l + z-z'^-z^ + (8/9) (z + z^ + z^) <G <l + 2z + (1/27)(1 -z) <3, 
which finally implies that, for all 9 and A in (0, 1/3], 

4(9, A) = C{9, A) 1^-^ - A^^l with 1 < 0(6, A) < 3. 

Now setting 5*^ = {sx^, j > 0} with Xj = (3 + 2jr/2/3)" we deduce as before that 
Srj is an ry-net for S. In order that sx^ G B{sx,xr]), it is required that (i|(Aj,A) = 
C{9, A)|A~^ - A~^| < x^r/^, which implies that \XJ^ - X~^\ < x^if. It follows that the 
number of elements of S^^ contained in the ball is bounded by 3x^/2 + 1 < exp (x^/2) 
for X >2. Hence the metric dimension of S with respect to the L2-distance is bounded 
by 2. It nevertheless follows from the fact that h{9, A) — > while d2{9, A) +co when 
9 and A tend to zero and classical arguments of Le Cam (1973) — see also Donoho 
and Liu (1987) or Yu (1997) — that the minimax risk over 5" is infinite when we use 
the L2-I0SS. 

7.2 Proof of Lemma [1] 

Let us begin with a preliminary lemma. 
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and 



Lemma 2 Let F and G be two disjoint sets with positive measures a = ^{F) and 
13 = fJ-{G) and g £1^2 such that infx^F g{x) > 0. Set g^ = g + e{al.G — /31f) for e > 0. 
Then is a density for e small enough and for any / G L2, 

lim 1 [dlige, f) - dlig, f)] = a [ {g - f) dfi - P [ (g - f) dfi (7.2) 
"^^0 2e ^ Jg Jf 

\\Tn -[h\g,,f)-h\gJ)] = f3 [ ^/f^ d^i - a [ ^/JF^ df,, (7.3) 
e Jf Jg 

with the convention that \J f g~^ dX = +00 if either /x(G H {17 = 0} H {/ > 0}) > 
or the integral diverges. 

Proof: Since J g^dn = 1 and ge > for s small enough ge is a density. Moreover, 
setting k = ale — P^f, we get 

dl{geJ) = l{g + ek-ffdfi = dl{gj)+e^kf + 2e J k{g-f)di^ 
and dZS]) follows. Let A(e) = [h'^{ge,f) - h'^igj)] and fix 77 > 0. Then 



Aie) 



.-1 



.-1 



j \fgfdii- j yj{g + ek)f d^j. 



L 



I Vaf - Via- £/?)/ dn+ Vsf - V{9 + £ot)f dji 
JF ^ ^ JG ^ J . 

M , f aVf 



■ dfj, 



F Vg - ^P + Vg JGn{g>0} y/gTea + ^ 
\l af led[i. 



d[i 



Gn{g=o}n{/>o} 



When E tends to 0, the first integral converges to (/3/2) ^JYg^ dfi and the second 
one converges to {a/2) J^^^^^q^ sJ fg~^ dfi, by monotone convergence. The last one 
converges to +00 if fi{G n {(? = 0} n {/ > 0}) > and otherwise, which achieves 
the proof of ([731). □ 

If ||fi V V2II00 > 2i?, we may assume, exchanging the roles of vi and V2 if necessary, 
that fi{A) > with A = {vi > V2 and vi > 2B}. Let G = {vi < B A V2}. If 
/z(C) > 0, we may apply Lemma [2] with F = A, G = C,g = vi and v'l = g^- We first 
set f = t. Since vi — t < B on C while vi — t > B on A, it follows from (j7.2p that 
d2{v'i,t) < d2{vi,t) for £ small enough. If we now set / = f 2 and use (|7.3p . we see that 
h{v'i,V2) < h{vi,V2) since V2 < vi on A and V2 > vi on G. We conclude by setting 
v'2 = V2. If //(C) = 0, then ^i{{B < vi < V2}) + ^J'{{v2 < vi < B}) = 1 and both 
sets have positive /i-measure since vi ^ V2- In this case we set F = {i? < ui < ^2}, 
G = {v2 < vi Au} and g = V2. Then ii{F) > and //(G) > since u < B < V2 on F 
and they are densities. If we use (|7.2p with f = u, we derive that d2{v2,u) < ^2(^2,'") 
for e small enough and if we use (|7.3p with f = vi,we derive that h{v2,vi) < h{v2,vi), 
in which case we set v[ = vi. 
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7.3 Proof of Theorem [5] 



We consider the family of tests tp{tn,tp, X) = '4'tn,tp,x{X) provided by the assumption 
with X = A\p—n\. Given this family of tests and S = {ti,i > 1}, we define the random 
function Vj^ on S as in Birge (2006a), i.e. we set TZi = {tj G S,j / i \ ijj{ti,tj,X) = 
tj} and 

r sup {d{ti,tj)} if 7^i #0; 

^X(*0 = < ^^^'^^ (7.4) 
[ if 7^i = 0. 

Given some tj G S", we want to bound 



Fs [DxiU) > xyi] for X > 1 and yi = M{s, U) V \/Aa-^i2\ 
Let us define the integer i^' by < 2^ < 2x'^. Then 

K>1, a2-'-^{xyif >a2-'-^yl>Ai/2 and e'^^ <x-^^/^"^'^. (7.5) 
Now, setting y = xyi, observe that 

P, [Vxik) > y] = P. [ 3j with d{ti,tj) > y and i;{ti,tj,X) = tj] < Si + S2, 
with 

5^1 = X] Mu,t,)>y IPs ['4'{ti,tj,X) = tj] ; = ^ ld(t,,tj)>y [V'(ii,*i,^) = tj] ■ 

j<i j>i 

If i = 1, then Si = and if i > 2, we can use (|5.2p and y > 4d(s, tj) to derive that 
Si < B ^ ldiu,t,)>y exp [-a2-V2(tj, tj) + A\i - j\] 



j<i 



< B exp [-a2 'yfx'^ + Ai] ^ e" 



Aj 



< S-4_^exp[-Az(x2-l)] < i?-4_^exp[-vl(x2-l)] 



< i3(l-e-^)" exp[-^x2] < i?(l-e-^) 
where we used ()7.5p . i > 1 and x > 1. Also, by (jS.ip . 

S2 < S ^ hiu,t,)>y exp [-a2-J(i2(t„ t,) - A|i - j\] 



„-2A/log2 



+00 



< BY,ew[-a2~'y^-A{j-i)] = 



exp 



j>i 
K 



k=l 



< B 



^exp -a2 * ''y'^-Ak^ + ^ exp[-AA;] 

.k=l 

with S4 = e-^^ (e^ - l)~^ and, by (I73D . 



i?(S3 + S4) 



S3 = e-^^ exp [-a2-*-^+V + ^j] < e"^^ ^ exp [-^(i2^-i - j) 

j=0 j=0 

< e-^^J^exp[-(2^-i-j)] < 3e-^^. 

j>0 
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We finally get, putting all the bounds together and using (17. 5|) again, 

[T^xi^i) > ^^Vi] ^ BC{A)x~^^/^°^^ for x > 1. (7.6) 
As a consequence V^{ti) < +oo a.s. and we can define 

SA = tp with p = min <|j T^x^^j) ^ ~^ V'^4a~^|> . 

In view of the definition of ^^jf , d{ti,tj) < T>-^{ti) V T>-^{tj), hence, for all ti £ S, 
d (sA, ti) < Vx {ti) + VAcF^ and 



d{sA,s) <Vx{ti) + VAa-^ + d{s,ti) <Vx{ti) + yi- 
It then follows from (j7.6p that 

P, [d{sA,s) > zyi] < BC{A){z - l)-2^/i°s2 ^ > 2. 
Integrating with respect to z leads to 

E, [{d {sA, s) Ivif] < BC{A, q) 1 < q < 2A/ log 2, 
and, since ti is arbitrary in S, 



[d^ (sA, s)] < BC(A, q) inf t,) V (0-^2')"^^ for 1 < g < 2A/ log 2. 

i>l L J 
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